M01: Lecture Note 2

Language, Probability, and Generative Systems

M01:
Notes
Lecture
Lecture covering Introduction to Generative AI & Business Applications.
Published

November 21, 2024

Modified

February 17, 2026

1 Text Analytics and Sentiment Analysis

1.1 Introduction to Text Analytics

Text analytics is the discipline concerned with extracting meaningful information, patterns, and insights from unstructured text data. In today’s digital world, vast amounts of information are generated in textual form — emails, social media posts, customer reviews, reports, and more. Text analytics provides the computational tools and methodologies to transform this raw text into structured knowledge that organizations can use for decision‑making, automation, and research.

The figure below provides a high‑level overview of the text mining landscape, illustrating how raw text is processed, analyzed, and converted into actionable insights.

Text Mining Overview
Figure 1. Text Analytics (Talib et al. 2016)

Text analytics draws from multiple fields — including information retrieval, machine learning, statistics, and linguistics — to build systems that can understand and interpret human language at scale. It is often used interchangeably with text mining, though the two terms have subtle differences that we will clarify shortly.

1.2 Text Mining Process

Text mining refers to the computational process of discovering patterns, extracting information, and generating structured representations from unstructured text. The workflow typically involves several stages, from lexical processing (working with characters and tokens) to structural and semantic interpretation (building syntax trees, extracting entities, and constructing knowledge bases).

The diagram below illustrates a typical NLP pipeline used in text mining systems:

NLP_Pipeline cluster_lexical Lexical processing cluster_structural Structural representation cluster_algorithm Algorithm Characters Characters Tokens Tokens Characters->Tokens TaggedTokens Tagged tokens Tokens->TaggedTokens SyntaxTree Syntax tree Tokens->SyntaxTree TaggedTokens->SyntaxTree EntityRelations Entity relationships KnowledgeBase Knowledge base EntityRelations->KnowledgeBase SyntaxTree->EntityRelations fst_pos Part-of-Speech SyntaxTree->fst_pos fst_logic Logic compiler KnowledgeBase->fst_logic fst_ie Information extractor KnowledgeBase->fst_ie fst_regex Regular expression fst_regex->Tokens fst_regex->TaggedTokens

This pipeline highlights three major layers:

1.2.1 Lexical Processing

Lexical processing focuses on the surface form of text — characters, words, and tokens.
- Characters represent the raw textual input.
- Tokens are meaningful units such as words or punctuation marks.
- Tagged tokens include additional linguistic information such as part‑of‑speech tags.

This stage prepares the text for deeper syntactic and semantic analysis.

1.2.2 Structural Representation

Once text is tokenized and annotated, the system constructs higher‑level structures: - Syntax trees capture grammatical relationships between words.
- Entity relationships identify connections between named entities (e.g., “Apple acquired Beats”).
- Knowledge bases store structured facts extracted from text.

These representations enable downstream tasks such as information extraction and reasoning.

1.2.3 Algorithmic Components

The pipeline integrates algorithmic modules such as: - Regular expressions for pattern matching
- Part‑of‑speech taggers
- Logic compilers for rule‑based reasoning
- Information extractors for identifying entities, events, and relations

These components interact with the structural layers to produce meaningful outputs.

1.3 Text Analytics: Definitions and Scope

Text analytics is a broader umbrella term that encompasses text mining as well as other analytical processes. It includes tasks such as information retrieval, summarization, classification, and visualization. The relationship between text mining and text analytics can be expressed mathematically as follows:

Text Mining=Information Extraction+Data Mining+Web MiningText Analytics=Information Retrieval+Text Mining \begin{align} \text{Text Mining} &= \text{Information Extraction} + \text{Data Mining} + \text{Web Mining} \\ \text{Text Analytics} &= \text{Information Retrieval} + \text{Text Mining} \end{align}

In other words: - Text mining focuses on extracting structured information from text.
- Text analytics includes text mining plus the broader ecosystem of tools used to search, retrieve, and analyze text at scale.

1.4 Application Areas of Text Mining

Text mining supports a wide range of applications across industries. Some of the most common include:

1.4.1 Information Extraction

Automatically identifying entities, relationships, and events from text.

1.4.2 Topic Tracking

Monitoring how topics evolve over time in news, social media, or research literature.

1.4.3 Summarization

Generating concise summaries of long documents, either extractively or abstractively.

1.4.4 Categorization

Assigning documents to predefined categories (e.g., spam detection, news classification).

1.4.5 Clustering

Grouping similar documents without predefined labels.

1.4.6 Concept Linking

Connecting related concepts across documents to reveal hidden associations.

1.4.7 Question Answering

Building systems that can answer natural‑language questions using textual data.

These applications demonstrate the versatility of text mining in both academic and commercial settings.

1.5 Text Mining and Analytics Pipeline

The following figure provides a general overview of the NLP pipeline that underlies most text mining and analytics systems:

General NLP Pipeline
Figure 2. NLP Pipeline

This pipeline typically includes: - Text preprocessing
- Feature extraction
- Model training or rule‑based analysis
- Evaluation and deployment

Each stage builds upon the previous one to transform raw text into structured insights.

1.6 Sentiment Analysis

Sentiment analysis is one of the most widely used applications of text mining. It aims to determine the emotional tone or subjective opinion expressed in text. This is especially valuable in domains such as marketing, customer service, finance, and social media analytics.

The figure below illustrates the major tasks, tools, and methods used in sentiment analysis:

Sentiment Analysis Process
Figure 3. Sentiment Analysis: tasks, tools, methods, and applications

1.6.1 Methods

Sentiment analysis can be approached using several methodological families: - Lexicon‑based methods, which rely on predefined sentiment dictionaries
- Machine learning methods, which learn patterns from labeled data
- Deep learning methods, which use neural networks to capture complex linguistic patterns
- Hybrid methods, which combine lexicons with machine learning for improved accuracy

1.6.2 Applications

Sentiment analysis is used in: - Domain‑specific applications such as product reviews or political analysis
- Large language model pipelines, where sentiment signals can guide downstream tasks

1.6.3 Challenges

Two major categories of challenges arise: - Methodological challenges, such as handling sarcasm or domain adaptation
- Text context challenges, including ambiguity, negation, and cultural variation

1.7 Sentiment Classification Algorithms

The following DOT diagram summarizes the major families of sentiment classification algorithms:

Sentiment_Approaches ML_Approach Machine Learning Approach Supervised Supervised Learning ML_Approach->Supervised Unsupervised Unsupervised Learning ML_Approach->Unsupervised Lexicon_Approach Lexicon-based Approach Dictionary Dictionary-based Approach Lexicon_Approach->Dictionary Corpus Corpus-based Approach Lexicon_Approach->Corpus DecisionTree Decision Tree Classifiers Supervised->DecisionTree LinearCls Linear Classifiers Supervised->LinearCls RuleBased Rule-based Classifiers Supervised->RuleBased Probabilistic Probabilistic Classifiers Unsupervised->Probabilistic SVM Support Vector Machine (SVM) DecisionTree->SVM NN Neural Network (NN) LinearCls->NN DL Deep Learning (DL) LinearCls->DL NB Naïve Bayes (NB) Probabilistic->NB BN Bayesian Network (BN) Probabilistic->BN ME Maximum Entropy (ME) Probabilistic->ME Statistical Statistical Corpus->Statistical Semantic Semantic Corpus->Semantic

This taxonomy divides sentiment analysis approaches into two broad categories:

1.7.1 Machine Learning Approaches

These include: - Supervised learning, where models learn from labeled examples - Specific algorithms include SVMs, neural networks, deep learning models, Naïve Bayes, Bayesian networks, and maximum entropy models. - Decision tree classifiers
- Linear classifiers
- Rule‑based classifiers
- Probabilistic models
- Unsupervised learning, where models infer structure without labels

1.7.2 Lexicon‑Based Approaches

These rely on sentiment dictionaries or corpus‑derived lexicons.
Two major subtypes include: - Dictionary‑based approaches, which use curated word lists
- Corpus‑based approaches, which infer sentiment from statistical or semantic patterns in large corpora

1.8 Types of Sentiment Analysis

Sentiment analysis can be specialized into several sub‑tasks:

  • Aspect‑based sentiment analysis, which identifies sentiment toward specific product attributes
  • Emotion‑based analysis, which classifies emotions such as joy, anger, or fear
  • Fine‑grained sentiment analysis, which assigns sentiment scores on a multi‑point scale
  • Intent‑based analysis, which infers user intentions behind the text

The following sequence of images illustrates these types:

Types of Sentiment Analysis

Types of Sentiment Analysis

Types of Sentiment Analysis

Types of Sentiment Analysis

2 Web Mining, Personalization, and Social Analytics

2.1 Web Mining

Web mining refers to the application of data mining techniques to discover patterns, extract knowledge, and derive insights from the vast and heterogeneous resources available on the World Wide Web. As the web continues to grow exponentially, organizations increasingly rely on automated methods to understand user behavior, content structure, and emerging trends. Web mining bridges information retrieval, machine learning, and analytics to make sense of this large‑scale digital ecosystem.

At a high level, web mining can be divided into three major categories:

  • Web Content Mining — extracting information from the content of web pages
  • Web Structure Mining — analyzing the link structure of the web
  • Web Usage Mining — understanding user behavior through logs and interaction data

The following diagram illustrates a typical workflow for weblog mining, one of the most common forms of web usage mining:

Weblog_Mining_Process RawData Raw Data Collection Weblog Data Collection RawData->Collection Integration Data Integration Collection->Integration Preprocessing Data Pre-processing Integration->Preprocessing Extraction Pattern Extraction Preprocessing->Extraction Analysis Pattern Analysis Extraction->Analysis Output Patterns Formed Analysis->Output

2.1.1 Understanding the Web Mining Process

The process begins with data collection, where raw weblog data is gathered from servers, proxies, or client‑side scripts. This data often includes page requests, timestamps, user identifiers, and clickstream paths.

Next, the data undergoes pre‑processing, which is essential because raw logs are noisy and inconsistent. Pre‑processing includes cleaning, session identification, user identification, and integration of logs from multiple sources.

Once the data is prepared, pattern discovery techniques such as clustering, association rule mining, sequential pattern mining, or classification are applied. These methods reveal behavioral patterns, navigation paths, or correlations in user activity.

Finally, the discovered patterns are interpreted and transformed into actionable insights — for example, improving website design, personalizing content, or optimizing marketing strategies.

2.2 Web Content Mining (Web Scraping)

Web content mining focuses on extracting useful information from the content of web pages. This content may include text, images, metadata, HTML structure, or embedded multimedia. Because much of the web is unstructured, automated extraction — often referred to as web scraping — plays a central role.

Web scraping systems typically follow a pipeline that includes:

  • Crawling: systematically navigating web pages
  • Parsing: interpreting HTML or XML structure
  • Extraction: identifying and capturing relevant content
  • Transformation: converting extracted data into structured formats such as CSV, JSON, or databases

The figure below illustrates the general workflow of a web scraping system:

Web Scraping Process
Figure 4. Web Scraping Process

Web content mining is widely used in applications such as price monitoring, competitive intelligence, sentiment analysis, and large‑scale text analytics.

2.3 Web Usage Mining (Web Analytics)

Web usage mining, often referred to as web analytics, focuses on analyzing user interactions with websites. Every visit, click, and navigation step generates data that can be used to understand user behavior and improve digital experiences.

Web usage mining typically involves:

  • Collecting clickstream data from server logs, cookies, or tracking scripts
  • Identifying user sessions to reconstruct navigation paths
  • Analyzing behavioral patterns such as frequent paths, drop‑off points, or conversion funnels
  • Supporting decision‑making in areas like personalization, recommendation systems, and UX design

The following figure illustrates the typical process of web usage mining:

Web Usage Mining Process

Web Mining Process

This form of mining is foundational to modern digital marketing, A/B testing, and user experience optimization.

2.4 Social Analytics

Social analytics focuses on understanding digital interactions and relationships across social platforms. As individuals, organizations, and communities increasingly communicate online, social analytics provides the tools to measure influence, detect trends, and interpret collective behavior.

A widely accepted definition describes social analytics as:

Monitoring, analyzing, measuring, and interpreting digital interactions and relationships of people, topics, ideas, and content.

Social analytics encompasses two major subfields:

  • Social Network Analysis (SNA) — studying the structure of relationships among individuals or entities
  • Social Media Analytics (SMA) — analyzing content, engagement, and trends on platforms such as Twitter, Instagram, Reddit, or LinkedIn

The diagram below captures this conceptual division:

Social_Analytics SocialAnalytics Social Analytics SNA Social Network Analysis (SNA) SocialAnalytics->SNA SMA Social Media Analytics SocialAnalytics->SMA

2.4.1 Social Network Analysis (SNA)

SNA examines how individuals or entities are connected. It uses graph theory to analyze nodes (people, organizations) and edges (relationships, interactions). Key metrics include centrality, density, modularity, and community structure. SNA is used in fields ranging from epidemiology to marketing and organizational behavior.

2.4.2 Social Media Analytics (SMA)

SMA focuses on the content and interactions occurring on social platforms. It includes sentiment analysis, trend detection, topic modeling, engagement measurement, and influencer identification. SMA helps organizations understand public opinion, track brand perception, and respond to emerging issues in real time.

2.5 Text Mining

Text mining refers to the process of extracting meaningful, high‑quality information from unstructured text. In modern organizations, an estimated 85–90% of corporate data exists in unstructured form—emails, documents, logs, social media, and more. This volume is growing rapidly, doubling roughly every 18 months. As a result, the ability to analyze text is no longer optional; it is essential for competitive advantage.

High‑quality information is typically obtained by identifying patterns, trends, and relationships within text. These patterns may be statistical, linguistic, or semantic, and they support downstream tasks such as classification, summarization, recommendation, and decision‑making.

Note

Text mining is the process of deriving high‑quality information from text, often through statistical pattern learning and structured extraction.

2.6 Knowledge Discovery from Web Data

Knowledge discovery from textual web data involves collecting, cleaning, structuring, and analyzing large volumes of online content. This process often includes crawling web pages, extracting relevant text, transforming it into structured representations, and applying analytical or machine learning techniques.

Knowledge Discovery from Web Data
Figure 5. Example Data exploration workflow for textual data extraction (Gupta et al. 2024)

3 Natural Language Processing (NLP)

Natural Language Processing (NLP) is the field that enables computers to process, understand, generate, and interact with human language. NLP systems bridge raw text and computational models, allowing machines to interpret meaning, perform tasks, and generate coherent language.

Key capabilities include:

  • Learning useful representations: encoding text into structured forms (e.g., embeddings) that capture meaning
  • Generating language: producing text for tasks such as translation, summarization, or dialogue
  • Connecting language and action: enabling systems to use language to perform tasks, reason, or interact with environments

4 General NLP Framework

At its core, NLP involves learning a function that maps an input XX to an output YY, where either or both involve language. The table below illustrates common NLP tasks:

NLP_Tasks nlp_table Input X Output Y Task Text Continuing Text Language Modeling Text Text in Other Language Translation Text Label Text Classification Text Linguistic Structure Language Analysis Image Text Image Captioning title Create a function to map an input X into an output Y, where X and/or Y involve language. title->nlp_table

This framework covers tasks such as language modeling, translation, classification, linguistic analysis, and multimodal tasks like image captioning.


5 Building NLP Systems

NLP systems can be built in several ways, ranging from rule‑based approaches to modern machine learning and prompting.

5.0.1 Rule‑based Systems

These rely on manually crafted rules:

def classify(x: str) -> str:
    sports_keywords = ["baseball", "soccer", "football", "tennis"]
    if any(keyword in x for keyword in sports_keywords):
        return "sports"
    else:
        return "other"

5.0.2 Prompting

Prompting uses a language model without training:

PromptLogic decision_box If the following sentence is about 'sports', reply 'sports'. Otherwise reply 'other'. lm_node LM decision_box->lm_node

5.0.3 Fine‑tuning

Fine‑tuning trains a model on paired examples X,Y\langle X, Y \rangle:

TextClassificationTraining samples Sentence Label "I love to play baseball." sports "The stock price is going up." other "He got a hat-trick yesterday." sports "He is wearing tennis shoes." other training Training samples->training model Model training->model


6 Data Requirements for System Building

Different approaches require different amounts of data:

  • Rules or intuition‑based prompting: no data required
  • Spot‑check prompting: small samples of input XX
  • Rigorous evaluation: development and test sets
  • Fine‑tuning: large labeled datasets; performance improves with scale

DataSplit split_table           Train Data X_train, y_train 60%                   Test Data X_test, y_test 20%                   Validation (Dev) X_val, y_val 20%        

7 Natural Language Processing Pipeline

The NLP pipeline transforms raw text into structured representations and downstream outputs. It typically includes data ingestion, parsing, cleaning, feature engineering, and consumption by models or analytics systems.

Pipeline cluster_consume Consumption cluster_products Data Products cluster_data Data Sources cluster_ingest Ingestion & Parsing cluster_transform Transformation ds1 Survey Data p1 Load ds1->p1 ds2 Logs p2 Parse ds2->p2 ds3 APIs ds3->p2 ds4 Databases ds4->p2 ds5 Files p3 Validate ds5->p3 ds6 External Feeds p4 Normalize ds6->p4 t1 Clean p1->t1 p2->t1 t2 Aggregate p3->t2 p4->t2 p5 Enrich t3 Feature Engineering p5->t3 o1 Structured Tables t1->o1 o2 Feature Store t2->o2 o3 Analytics Dataset t3->o3 c1 Dashboards o1->c1 c2 ML Models o2->c2 c3 Reports o3->c3

7.1 Text Summarization

Text summarization condenses long documents into concise, informative summaries. The process includes preprocessing, feature extraction, sentence ranking, and summary construction.

SummarizationPipeline cluster_rank 3. Sentence Ranking cluster_summary 4. Summary Construction cluster_pre 1. Pre-processing cluster_features 2. Linguistic & Statistical Features TextInput Text Input Lowercase Lowercasing TextInput->Lowercase Punct Remove punctuation Lowercase->Punct Tokenize Tokenization Punct->Tokenize Stopwords Stopword removal Tokenize->Stopwords SentSeg Sentence segmentation Stopwords->SentSeg POSTag POS tagging SentSeg->POSTag WordSeg Word segmentation SentSeg->WordSeg OccStats Occurrence statistics (TF, TF-IDF) SentSeg->OccStats Keywords Keyword extraction POSTag->Keywords WordSeg->Keywords OccStats->Keywords SentFeat Sentence-level features Keywords->SentFeat Unsupervised Unsupervised (TextRank, LexRank) SentFeat->Unsupervised Supervised Supervised ML models SentFeat->Supervised Neural Neural models (BART, T5, Pegasus) SentFeat->Neural Select Select top sentences Unsupervised->Select Supervised->Select Neural->Select OrderSmooth Order and smooth Select->OrderSmooth Compress Optional compression/paraphrasing OrderSmooth->Compress SummaryOutput Summary Output Compress->SummaryOutput

7.2 Core NLP Tasks

NLP encompasses a wide range of tasks, including:

  • Part‑of‑speech tagging
  • Text segmentation
  • Word sense disambiguation
  • Handling syntactic ambiguity
  • Speech acts
  • Question answering
  • Summarization
  • Natural language generation and understanding
  • Machine translation
  • Speech recognition and text‑to‑speech
  • OCR
  • Text proofing

8 Classical vs. Deep Learning NLP

To avoid redundancy, only one representative version of each diagram is kept.

Classical vs. Deep Learning NLP

9 Sentiment Classification

Sentiment analysis determines whether text expresses positive, negative, or neutral sentiment.

9.1 Text Information

def read_xy_data(filename: str) -> tuple[list[str], list[int]]:
    x_data = []
    y_data = []
    with open(filename, 'r') as f:
        for line in f:
            label, text = line.strip().split(' ||| ')
            x_data.append(text)
            y_data.append(int(label))
    return x_data, y_data


x_train, y_train = read_xy_data('./data/sentiment-treebank/train.txt')
x_test, y_test = read_xy_data('./data/sentiment-treebank/dev.txt')


print("Document:-", x_train[0])
print("Label:-", y_train[0])
Document:- The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
Label:- 1

9.2 Segmentation, Tokenization, and Cleaning

def extract_features(x: str) -> dict[str, float]:
    features = {}
    x_split = x.split(' ')

    # Count the number of "good words" and "bad words" in the text
    good_words = ['love', 'good', 'nice', 'great', 'enjoy', 'enjoyed']
    bad_words = ['hate', 'bad', 'terrible',
                 'disappointing', 'sad', 'lost', 'angry']
    for x_word in x_split:
        if x_word in good_words:
            features['good_word_count'] = features.get(
                'good_word_count', 0) + 1
        if x_word in bad_words:
            features['bad_word_count'] = features.get(
                'bad_word_count', 0) + 1

    # The "bias" value is always one, to allow us to assign a "default" score to the text
    features['bias'] = 1

    return features


feature_weights = {'good_word_count': 1.0, 'bad_word_count': -1.0, 'bias': 0.5}

9.3 Decision Algorithm

def run_classifier(x: str) -> int:
    score = 0
    for feat_name, feat_value in extract_features(x).items():
        score = score + feat_value * feature_weights.get(feat_name, 0)
    if score > 0:
        return 1
    elif score < 0:
        return -1
    else:
        return 0

def calculate_accuracy(x_data: list[str], y_data: list[int]) -> float:
    total_number = 0
    correct_number = 0
    for x, y in zip(x_data, y_data):
        y_pred = run_classifier(x)
        total_number += 1
        if y == y_pred:
            correct_number += 1
    return correct_number / float(total_number)

9.4 Results

label_count = {}
for y in y_test:
    if y not in label_count:
        label_count[y] = 0
    label_count[y] += 1
print(label_count)

train_accuracy = calculate_accuracy(x_train, y_train)
test_accuracy = calculate_accuracy(x_test, y_test)

print(f'Train accuracy: {train_accuracy}')
print(f'Dev/test accuracy: {test_accuracy}')

# Display 4 decimal
print(f'Train accuracy: {train_accuracy:.4f}')
print(f'Dev/test accuracy: {test_accuracy:.4f}')
{1: 444, 0: 229, -1: 428}
Train accuracy: 0.4345739700374532
Dev/test accuracy: 0.4214350590372389
Train accuracy: 0.4346
Dev/test accuracy: 0.4214

9.5 Model Evaluation

import random

def find_errors(x_data, y_data):
    error_ids = []
    y_preds = []
    for i, (x, y) in enumerate(zip(x_data, y_data)):
        y_preds.append(run_classifier(x))
        if y != y_preds[-1]:
            error_ids.append(i)
    for _ in range(5):
        my_id = random.choice(error_ids)
        x, y, y_pred = x_data[my_id], y_data[my_id], y_preds[my_id]
        print(f'{x}\ntrue label: {y}\npredicted label: {y_pred}\n')


find_errors(x_train, y_train)
`` Freaky Friday , '' it 's not .
true label: -1
predicted label: 1

-LRB- Screenwriter -RRB- Pimental took the Farrelly Brothers comedy and feminized it , but it is a rather poor imitation .
true label: 0
predicted label: 1

... this is n't even a movie we can enjoy as mild escapism ; it is one in which fear and frustration are provoked to intolerable levels .
true label: -1
predicted label: 1

The movie itself appears to be running on hypertime in reverse as the truly funny bits get further and further apart .
true label: -1
predicted label: 1

But it also comes with the laziness and arrogance of a thing that already knows it 's won .
true label: -1
predicted label: 1

9.6 Improving the Model

A typical improvement loop:

  1. Diagnose errors
  2. Modify features or scoring
  3. Measure improvements
  4. Iterate
  5. Evaluate on test data

10 Linguistic Barriers

Challenges include:

  • Low‑frequency words
  • Conjugation
  • Negation
  • Metaphor
  • Analogy
  • Symbolic language
Tip

Consider how feature engineering or modern embeddings can address these issues.

11 Probabilistic Topic Modeling

Topic modeling uncovers latent themes in large text corpora.

screenshot of papers from Science magazine about topic modeling
Figure 6. Probabilistic Topic Modeling

11.1 Machine Learning Foundations

Machine learning aims to estimate a function f(x)f(x) that predicts labels from text.
The function may be linear or nonlinear, hand‑crafted or learned from data.

Machine Learning end to end pipeline
Figure 7. Machine Learning

11.2 Bag of Words Approach

Bag of Words (BoW) represents text as unordered collections of word counts.

Bag of Words
Figure 8. Bag of Words

11.3 Why BoW Matters

  • Converts text into fixed‑length numeric vectors
  • Simple, interpretable, and effective for many tasks
  • Ignores word order but preserves frequency

11.4 Text Cleaning

Original: Despite suffering a sense-of-humour failure...
Cleaned: despite suffering a sense of humour failure...
import random

def sample_sentences(x, y, n=4, seed=42):
    random.seed(seed)
    idx = random.sample(range(len(x)), n)
    return [(y[i], x[i]) for i in idx]

samples = sample_sentences(x_train, y_train, n=4)

for i, (label, text) in enumerate(samples, 1):
    print(f"S{i} [label={label}]: {text}")
S1 [label=1]: With Dickens ' words and writer-director Douglas McGrath 's even-toned direction , a ripping good yarn is told .
S2 [label=0]: Maybe Thomas Wolfe was right : You ca n't go home again .
S3 [label=-1]: Despite suffering a sense-of-humour failure , The Man Who Wrote Rocky does not deserve to go down with a ship as leaky as this .
S4 [label=1]: It will guarantee to have you leaving the theater with a smile on your face .

Cleaning typically includes lowercasing, removing punctuation, and optional stopword removal.

11.5 Tokenization and CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

docs = [text for _, text in samples]

vectorizer = CountVectorizer(
    lowercase=True,
    stop_words=None   # keep everything for teaching clarity
)

X = vectorizer.fit_transform(docs)

bow_df = pd.DataFrame(
    X.toarray(),
    columns=vectorizer.get_feature_names_out(),
    index=[f"S{i+1}" for i in range(len(docs))]
)

bow_df.iloc[:, 0:8]
again and as ca deserve despite dickens direction
S1 0 1 0 0 0 0 1 1
S2 1 0 0 1 0 0 0 0
S3 0 0 2 0 1 1 0 0
S4 0 0 0 0 0 0 0 0

11.6 Vocabulary, DTM, and Word Frequencies

BoW produces a document‑term matrix (DTM) where rows are documents and columns are word counts.

12 Strengths and Limitations of BoW

12.1 Strengths

  • Simple
  • Fast
  • Effective for short, structured text

12.2 Limitations

  • Ignores order and meaning
  • Sparse representations
  • Vocabulary explosion

12.3 BoW in Practice (Sentiment Analysis)

SentimentPipeline Text Textual Data (A statement) Step1 Step 1: Calculate O–S Polarity Text->Step1 Lex1 Corpus Lex1->Step1 Lex2 Lexicon Step2 Step 2: Calculate N–P Polarity of the sentiment Lex2->Step2 Decision Is there a sentiment? Step1->Decision Record Record Polarity, Strength, and Target Step1->Record O–S Polarity Decision->Text Decision->Step2 Yes Step3 Step 3: Identify the target for the sentiment Step2->Step3 Step2->Record N–P Polarity Step3->Record Step4 Step 4: Tabulate & aggregate sentiment analysis results Record->Step4

12.4 When to Use Bag of Words

Use BoW when:

  • Data is small or medium
  • Interpretability matters
  • You need a fast baseline

Avoid BoW when:

  • Documents are long
  • Semantic nuance matters
  • Context is essential

12.5 Key Takeaways

  • Bag of Words is fundamentally about counting, not understanding
  • It is a stepping stone to TF‑IDF, embeddings, and transformers
  • Representation is the foundation of all NLP
  • Generative and discriminative models offer complementary perspectives

References

Gupta, Sonakshi, Akhlak Mahmood, Pranav Shetty, Aishat Adeboye, and Rampi Ramprasad. 2024. “Data Extraction from Polymer Literature Using Large Language Models.” Communications Materials 5 (December): 269. https://doi.org/10.1038/s43246-024-00708-9.
Talib, Ramzan, Muhammad Kashif Hanif, Shaeela Ayesha, and Fakeeha Fatima. 2016. “Text Mining: Techniques, Applications and Issues.” International Journal of Advanced Computer Science and Applications 7 (11): 414–18.